{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_X9GuXoSXleA"
   },
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Active Learning for a Drifting Image Classification Model</h1>\n",
    "\n",
    "Imagine you're in charge of maintaining a model that classifies the action of people in photographs. Your model initially performs well in production, but its performance gradually degrades over time.\n",
    "\n",
    "Phoenix helps you surface the reason for this regression by analyzing the embeddings representing each image. Your model was trained on crisp and high-resolution images, but as you'll discover, it's encountering blurred and noisy images in production that it can't correctly classify.\n",
    "\n",
    "In this tutorial, you will:\n",
    "\n",
    "- Download curated datasets of embeddings and predictions\n",
    "- Define a schema to describe the format of your data\n",
    "- Launch Phoenix to visually explore your embeddings\n",
    "- Investigate problematic clusters\n",
    "- Export problematic production data for labeling and fine-tuning\n",
    "\n",
    "Let's get started!\n",
    "\n",
    "## Install Dependencies and Import Libraries\n",
    "\n",
    "Install Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -Uq \"arize-phoenix[embeddings]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import HTML, display\n",
    "\n",
    "import phoenix as px"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OFeF5_Bysd2f"
   },
   "source": [
    "## Download and Inspect the Data\n",
    "\n",
    "Download production and training image data containing photographs of people performing various actions (sleeping, eating, running, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_parquet(\n",
    "    \"http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/cv/human-actions/human_actions_training.parquet\"\n",
    ")\n",
    "prod_df = pd.read_parquet(\n",
    "    \"http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/cv/human-actions/human_actions_production.parquet\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View a few training data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns of the dataframe are:\n",
    "- **prediction_id:** a unique identifier for each data point\n",
    "- **prediction_ts:** the Unix timestamps of your predictions\n",
    "- **url:** a link to the image data\n",
    "- **image_vector:** the embedding vectors representing each image\n",
    "- **actual_action:** the ground truth for each image\n",
    "- **predicted_action:** the predicted class for the image\n",
    "\n",
    "View a few production data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the production data is missing ground truth, i.e., has no \"actual_action\" column.\n",
    "\n",
    "Display a few images alongside their predicted and actual labels. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_examples(df):\n",
    "    \"\"\"\n",
    "    Displays each image alongside the actual and predicted classes.\n",
    "    \"\"\"\n",
    "    sample_df = df.reindex(columns=[\"actual_action\", \"predicted_action\", \"url\"]).rename(\n",
    "        columns={\"url\": \"image\"}\n",
    "    )\n",
    "    html = sample_df.to_html(\n",
    "        escape=False, index=False, formatters={\"image\": lambda url: f'<img src=\"{url}\">'}\n",
    "    )\n",
    "    display(HTML(html))\n",
    "\n",
    "\n",
    "display_examples(train_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launch Phoenix\n",
    "\n",
    "Define a schema to tell Phoenix what the columns of your training dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://arize.com/docs/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_schema = px.Schema(\n",
    "    timestamp_column_name=\"prediction_ts\",\n",
    "    prediction_label_column_name=\"predicted_action\",\n",
    "    actual_label_column_name=\"actual_action\",\n",
    "    embedding_feature_column_names={\n",
    "        \"image_embedding\": px.EmbeddingColumnNames(\n",
    "            vector_column_name=\"image_vector\",\n",
    "            link_to_data_column_name=\"url\",\n",
    "        ),\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The schema for your production data is the same, except it does not have an actual label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_schema = px.Schema(\n",
    "    timestamp_column_name=\"prediction_ts\",\n",
    "    prediction_label_column_name=\"predicted_action\",\n",
    "    embedding_feature_column_names={\n",
    "        \"image_embedding\": px.EmbeddingColumnNames(\n",
    "            vector_column_name=\"image_vector\",\n",
    "            link_to_data_column_name=\"url\",\n",
    "        ),\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create Phoenix datasets that wrap your dataframes with schemas that describe them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_ds = px.Inferences(dataframe=prod_df, schema=prod_schema, name=\"production\")\n",
    "train_ds = px.Inferences(dataframe=train_df, schema=train_schema, name=\"training\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Launch Phoenix. Follow the instructions in the UI to open the Phoenix UI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(session := px.launch_app(primary=prod_ds, reference=train_ds)).view()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find and Export Problematic Clusters\n",
    "\n",
    "Click on \"image_embedding\" in the \"Embeddings\" section.\n",
    "\n",
    "![click on image embedding](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/click_on_image_embedding.png)\n",
    "\n",
    "Select a period of high drift in the Euclidean distance graph at the top.\n",
    "\n",
    "![select period of high drift](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/select_period_of_high_drift.png)\n",
    "\n",
    "Click on the top cluster in the panel on the left. Phoenix has identified this cluster as problematic because it consists entirely or almost entirely of production data, meaning that your model is making production inferences on data the likes of which it never saw during training.\n",
    "\n",
    "![select top cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/select_top_cluster.png)\n",
    "\n",
    "Use the panel at the bottom to examine the data points in this cluster. What do you notice about these data points that is different from the training data points you saw earlier?\n",
    "\n",
    "![inspect points in cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/inspect_points_in_cluster.png)\n",
    "\n",
    "The data points in the cluster above are grainy and noisy. Click on the \"Export\" button to save your cluster for relabeling and fine-tuning.\n",
    "\n",
    "![export cluster](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/image-classification/export_cluster.png)\n",
    "\n",
    "## Load and View Exported Data\n",
    "\n",
    "View the exported cluster as a dataframe in your notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_df = session.exports[-1]\n",
    "export_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display a few examples from your exported data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_examples(export_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congrats! You've pinpointed the blurry or noisy images that are hurting your model's performance in production. As an actionable next step, you can label your exported production data and fine-tune your model to improve performance."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
